Learning from Data Sets with Missing Labels

نویسنده

Andrew Smith

چکیده

This paper consider the task of learning discriminative classifiers of data when some class labels are missing from the data set(so-called “semi-supervised” learning), specifically when the labeled data are not drawn from the same distribution as the unlabeled data. This is an important issue in domains in which learning from only the labeled samples can result in a classifier that is not appropriate for the distribution of data to which it is to be applied. For example, lending institutions create models of who is likely to repay a loan from training sets consisting of people in their records who were given loans in the past; however, the institution only approved loan applications of those it judged likely to repay a loan. Learning from only approved loans yields an incorrect model because the training set is a biased sample of the general population of applicants, which is the population in which the model is to be used. Semi-supervised learning attempts to overcome this bias by including the unlabeled data in the learning process. This paper systematically explores the different types of bias that can arise in a semi-supervised setting, with examples of real-world situations. We use Bayesian networks to formalize each type of bias as a set of conditional independence relationships and for each case we present an overview of available learning algorithms is presented. These algorithms have been published in separate fields of research, including epidemiology, medical observational studies, econometrics, sociology, and credit scoring. 1 Semi-supervised learning Semi-supervised classifier learning is learning from a data set in which only some of the samples have class labels, which can occur for a variety of reasons. For example, this arises when building a model of whose loan applications to approve, or more precisely, building a model of applicants’ default/repayment behavior. (e.g. [4] [6] [7] [9]) When people apply for a loan, their application is either accepted or rejected, depending on the lender’s guess as to how likely the applicant is to repay the loan. Then the people whose applications were accepted either eventually repay the loan or default on the loan, which defines the two classes (good borrowers/bad borrowers). We would like to use some data-mining model to predict how likely a person is to repay a loan, so we can better decide whom to reject or accept. We learn the parameters to the model from a database created by a financial institution; however such databases only have repay/default behavior recorded for the people whose applications were accepted, since the rejected people never had a chance to repay or default on the loan. The accepts clearly constitute a biased sample of all the applicants, so using traditional supervised learning algorithms on only the labeled samples could lead to a biased model. We must use a learning algorithm that takes both the labeled and unlabeled samples into account. Using such semi-supervised

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Spectral Label Refinement for Noisy and Missing Text Labels

With the recent growth of online content on the Web, there have been more user generated data with noisy and missing labels, e.g., social tags and voted labels from Amazon’s Mechanical Turks. Most of machine learning methods, which require accurate label sets, could not be trusted when the label sets were yet unreliable. In this paper, we provide a text label refinement algorithm to adjust the ...

متن کامل

Improving Multilabel Classification by Avoiding Implicit Negativity with Incomplete Data

Many real world problems require multi-label classification, in which each training instance is associated with a set of labels. There are many existing learning algorithms for multi-label classification; however, these algorithms assume implicit negativity, where missing labels in the training data are automatically assumed to be negative. Additionally, many of the existing algorithms do not h...

متن کامل

Conditional Restricted Boltzmann Machines for Multi-label Learning with Incomplete Labels

Standard multi-label learning methods assume fully labeled training data. This assumption however is impractical in many application domains where labels are difficult to collect and missing labels are prevalent. In this paper, we develop a novel conditional restricted Boltzmann machine model to address multi-label learning with incomplete labels. It uses a restricted Boltzmann machine to captu...

متن کامل

Multi-label learning with missing labels for image annotation and facial action unit recognition

Many problems in computer vision, such as image annotation, can be formulated as multi-label learning problems. It is typically assumed that the complete label assignment for each training image is available. However, this is often not the case in practice, as many training images may only be annotated with a partial set of labels, either due to the intensive effort to obtain the fully labeled ...

متن کامل

Semi-Supervised Learning with Adversarially Missing Label Information

We address the problem of semi-supervised learning in an adversarial setting. Instead of assuming that labels are missing at random, we analyze a less favorable scenario where the label information can be missing partially and arbitrarily, which is motivated by several practical examples. We present nearly matching upper and lower generalization bounds for learning in this setting under reasona...

متن کامل

Scalable Generative Models for Multi-label Learning with Missing Labels

We present a scalable, generative framework for multi-label learning with missing labels. Our framework consists of a latent factor model for the binary label matrix, which is coupled with an exposure model to account for label missingness (i.e., whether a zero in the label matrix is indeed a zero or denotes a missing observation). The underlying latent factor model also assumes that the low-di...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2005

Learning from Data Sets with Missing Labels

نویسنده

چکیده

منابع مشابه

Spectral Label Refinement for Noisy and Missing Text Labels

Improving Multilabel Classification by Avoiding Implicit Negativity with Incomplete Data

Conditional Restricted Boltzmann Machines for Multi-label Learning with Incomplete Labels

Multi-label learning with missing labels for image annotation and facial action unit recognition

Semi-Supervised Learning with Adversarially Missing Label Information

Scalable Generative Models for Multi-label Learning with Missing Labels

عنوان ژورنال:

اشتراک گذاری